无法构建 spark-tensorflow-connector,因为文件已存在
Failed to build spark-tensorflow-connector because file already exists
我 运行 在 GCP 的 Dataproc 上构建 spark-tensorflow-connector
时遇到了问题。
由于
其中一项测试失败时出现问题
java.lang.IllegalStateException: LocalPath /tmp/spark-connector-propagate7442350445858279141 already exists. SaveMode: ErrorIfExists
我认为问题与 LocalWiteSuite.scala 脚本的这一部分有关:
"Propagate" should {
"write data locally" in {
// Create a dataframe with 2 partitions
val rdd = spark.sparkContext.parallelize(testRows, numSlices = 2)
val df = spark.createDataFrame(rdd, schema)
// Write the partitions onto the local hard drive. Since it is going to be the
// local file system, the partitions will be written in the same directory of the
// same machine.
// In a distributed setting though, two different machines would each hold a single
// partition.
val localPath = Files.createTempDirectory("spark-connector-propagate").toAbsolutePath.toString
// Delete the directory, the default mode is ErrorIfExists
Files.delete(Paths.get(localPath))
df.write.format("tfrecords")
.option("recordType", "Example")
.option("writeLocality", "local")
.save(localPath)
// Read again this directory, this time using the Hadoop file readers, it should
// return the same data.
// This only works in this test and does not hold in general, because the partitions
// will be written on the workers. Everything runs locally for tests.
val df2 = spark.read.format("tfrecords").option("recordType", "Example")
.load(localPath).sort("id").select("id", "IntegerTypeLabel", "LongTypeLabel",
"FloatTypeLabel", "DoubleTypeLabel", "VectorLabel", "name") // Correct column order.
assert(df2.collect().toSeq === testRows.toSeq)
}
}
}
如果我没理解错的话,数据集有两个分区,它似乎试图用相同的文件名在本地写入。
有人 运行 解决过这个问题还是我漏掉了一步?
请注意,我发布了类似的 question on GitHub
考虑到这是一个非常有价值的包,而且很多人已经成功安装了 spark-tensorflow-connector,我感觉我错过了一步:
我没有将 Tensorflow hadoop 构建为在步骤 3 中明确定义的 Maven 依赖项。
然而,在构建 Tensorflow hadoop 时,我不得不使用一个额外的命令:export _JAVA_OPTIONS=-Djdk.net.URLClassPath.disableClassPathURLCheck=true
正如 Michael 来自 Maven surefire 的建议无法找到 ForkedBooter class
编辑:Dataproc 上的问题仍然存在
解法:
经过一番研究,我直接加载了spark-tensorflow-connector and installed it with the directions posted by Maven. I did not have to install Tensorflow Hadoop as suggested in the Tensorflow Ecosystem的最新版本。请注意,我能够在我的 Dataproc 集群上安装 jar 文件。
我 运行 在 GCP 的 Dataproc 上构建 spark-tensorflow-connector
时遇到了问题。
由于
其中一项测试失败时出现问题java.lang.IllegalStateException: LocalPath /tmp/spark-connector-propagate7442350445858279141 already exists. SaveMode: ErrorIfExists
我认为问题与 LocalWiteSuite.scala 脚本的这一部分有关:
"Propagate" should {
"write data locally" in {
// Create a dataframe with 2 partitions
val rdd = spark.sparkContext.parallelize(testRows, numSlices = 2)
val df = spark.createDataFrame(rdd, schema)
// Write the partitions onto the local hard drive. Since it is going to be the
// local file system, the partitions will be written in the same directory of the
// same machine.
// In a distributed setting though, two different machines would each hold a single
// partition.
val localPath = Files.createTempDirectory("spark-connector-propagate").toAbsolutePath.toString
// Delete the directory, the default mode is ErrorIfExists
Files.delete(Paths.get(localPath))
df.write.format("tfrecords")
.option("recordType", "Example")
.option("writeLocality", "local")
.save(localPath)
// Read again this directory, this time using the Hadoop file readers, it should
// return the same data.
// This only works in this test and does not hold in general, because the partitions
// will be written on the workers. Everything runs locally for tests.
val df2 = spark.read.format("tfrecords").option("recordType", "Example")
.load(localPath).sort("id").select("id", "IntegerTypeLabel", "LongTypeLabel",
"FloatTypeLabel", "DoubleTypeLabel", "VectorLabel", "name") // Correct column order.
assert(df2.collect().toSeq === testRows.toSeq)
}
}
}
如果我没理解错的话,数据集有两个分区,它似乎试图用相同的文件名在本地写入。
有人 运行 解决过这个问题还是我漏掉了一步?
请注意,我发布了类似的 question on GitHub
考虑到这是一个非常有价值的包,而且很多人已经成功安装了 spark-tensorflow-connector,我感觉我错过了一步:
我没有将 Tensorflow hadoop 构建为在步骤 3 中明确定义的 Maven 依赖项。
然而,在构建 Tensorflow hadoop 时,我不得不使用一个额外的命令:export _JAVA_OPTIONS=-Djdk.net.URLClassPath.disableClassPathURLCheck=true
正如 Michael 来自 Maven surefire 的建议无法找到 ForkedBooter class
编辑:Dataproc 上的问题仍然存在
解法:
经过一番研究,我直接加载了spark-tensorflow-connector and installed it with the directions posted by Maven. I did not have to install Tensorflow Hadoop as suggested in the Tensorflow Ecosystem的最新版本。请注意,我能够在我的 Dataproc 集群上安装 jar 文件。